Finding Centuries-Old Hyperlinks: a Novel Semi-Supervised Shape Classifier

نویسندگان

  • Xiaoyue Wang
  • Eamonn Keogh
چکیده

Hyperlinks are so useful for searching and browsing modern digital collections that researchers have longer wondered if it is possible to retroactively add hyperlinks to digitized historical documents. There has already been significant research into this endeavor for historical text; however, in this work we consider the problem of adding hyperlinks among graphic elements. While such a system would not have the ubiquitous utility of text-based hyperlinks, as we will show, there are several domains where it can significantly augment textual information. While OCR of historical text is known to be a difficult problem, the actual words themselves are inherently discrete. Thus, two words are either identical or not. This means that off-the-shelf machine learning algorithms, including semisupervised learning, can be easily used. However, as we shall demonstrate, semi-supervised learning does not work well with images, because we cannot expect binary matching decisions. Rather we must deal with degrees of matching. In this work we make the novel observation that this “degree of matching” biased algorithms make overly confident predictions about simple shapes. We show that a simple technique for correcting this bias, and demonstrate through extensive experiments that our method significantly improves accuracy on diverse historical image collections. Keywords-Historical Manuscripts, Hyperlinks, SemiSupervised Learning

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

Semi-Supervised Learning Based Prediction of Musculoskeletal Disorder Risk

This study explores a semi-supervised classification approach using random forest as a base classifier to classify the low-back disorders (LBDs) risk associated with the industrial jobs. Semi-supervised classification approach uses unlabeled data together with the small number of labelled data to create a better classifier. The results obtained by the proposed approach are compared with those o...

متن کامل

Semi-supervised Learning Using an Unsupervised Atlas

In many machine learning problems, high-dimensional datasets often lie on or near manifolds of locally low-rank. This knowledge can be exploited to avoid the “curse of dimensionality” when learning a classifier. Explicit manifold learning formulations such as lle are rarely used for this purpose, and instead classifiers may make use of methods such as local co-ordinate coding or auto-encoders t...

متن کامل

Semi-supervised learning for improved expression of uncertainty in discriminative classifiers

Seeking classifier models that are not overconfident and that better represent the inherent uncertainty over a set of choices, we extend an objective for semi-supervised learning for neural networks to two models from the ratio semi-definite classifier (RSC) family. We show that the RSC family of classifiers produces smoother transitions between classes on a vowel classification task, and that ...

متن کامل

Discriminative Similarity for Clustering and Semi-Supervised Learning

Similarity-based clustering and semi-supervised learning methods separate the data into clusters or classes according to the pairwise similarity between the data, and the pairwise similarity is crucial for their performance. In this paper, we propose a novel discriminative similarity learning framework which learns discriminative similarity for either data clustering or semi-supervised learning...

متن کامل

Minimally-supervised methods for Arabic Named Entity Recognition

Supervised methods can achieve high performance on NLP tasks, such as Named Entity Recognition (NER), but new annotations are required for every new domain and/or genre change. This has motivated research in minimally supervised methods such as semisupervised learning and distant learning, but neither technique has yet achieved performance levels comparable to those of supervised methods. Semi-...

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

عنوان ژورنال:

دوره   شماره 

صفحات  -

تاریخ انتشار 2009